Cross-study validation for the assessment of prediction algorithms
نویسندگان
چکیده
MOTIVATION Numerous competing algorithms for prediction in high-dimensional settings have been developed in the statistical and machine-learning literature. Learning algorithms and the prediction models they generate are typically evaluated on the basis of cross-validation error estimates in a few exemplary datasets. However, in most applications, the ultimate goal of prediction modeling is to provide accurate predictions for independent samples obtained in different settings. Cross-validation within exemplary datasets may not adequately reflect performance in the broader application context. METHODS We develop and implement a systematic approach to 'cross-study validation', to replace or supplement conventional cross-validation when evaluating high-dimensional prediction models in independent datasets. We illustrate it via simulations and in a collection of eight estrogen-receptor positive breast cancer microarray gene-expression datasets, where the objective is predicting distant metastasis-free survival (DMFS). We computed the C-index for all pairwise combinations of training and validation datasets. We evaluate several alternatives for summarizing the pairwise validation statistics, and compare these to conventional cross-validation. RESULTS Our data-driven simulations and our application to survival prediction with eight breast cancer microarray datasets, suggest that standard cross-validation produces inflated discrimination accuracy for all algorithms considered, when compared to cross-study validation. Furthermore, the ranking of learning algorithms differs, suggesting that algorithms performing best in cross-validation may be suboptimal when evaluated through independent validation. AVAILABILITY The survHD: Survival in High Dimensions package (http://www.bitbucket.org/lwaldron/survhd) will be made available through Bioconductor.
منابع مشابه
Improving the Performance of Machine Learning Algorithms for Heart Disease Diagnosis by Optimizing Data and Features
Heart is one of the most important members of the body, and heart disease is the major cause of death in the world and Iran. This is why the early/on time diagnosis is one of the significant basics for preventing and reducing deaths of this disease. So far, many studies have been done on heart disease with the aim of prediction, diagnosis, and treatment. However, most of them have been mostly f...
متن کاملDetermining optimal value of the shape parameter $c$ in RBF for unequal distances topographical points by Cross-Validation algorithm
Several radial basis function based methods contain a free shape parameter which has a crucial role in the accuracy of the methods. Performance evaluation of this parameter in different functions with various data has always been a topic of study. In the present paper, we consider studying the methods which determine an optimal value for the shape parameter in interpolations of radial basis ...
متن کاملطراحی شبکه عصبی مصنوعی برای پیشبینی توأم سندرم متابولیک و شاخص مقاومت به انسولین (HOMA-IR): مطالعه قند و لیپید تهران
Background & Objective: Mixed outcomes arise when, in a multivariate model, response variables measured on different scales such as binary and continuous. In a bivariate modeling, when there are mixed response variables, the common methods in classic statistics have shortcomings. This study aimed at designing an appropriate ANN model for modeling and predicting the bivariate mixed responses i...
متن کاملA Novel LSSVM Based Algorithm to Increase Accuracy of Bacterial Growth Modeling
Background: The recent progress and achievements in the advanced, accurate, and rigorously evaluated algorithms has revolutionized different aspects of the predictive microbiology including bacterial growth.Objectives: In this study, attempts were made to develop a more accurate hybrid algorithm for predicting the bacterial growth curve which can also be ...
متن کاملApplication of ensemble learning techniques to model the atmospheric concentration of SO2
In view of pollution prediction modeling, the study adopts homogenous (random forest, bagging, and additive regression) and heterogeneous (voting) ensemble classifiers to predict the atmospheric concentration of Sulphur dioxide. For model validation, results were compared against widely known single base classifiers such as support vector machine, multilayer perceptron, linear regression and re...
متن کامل